So, we were talking about another machine learning problem, which is linear regression
and classification.
So, completely different.
The same setup, slightly different application.
The idea is that we've learned with PAC learning last time that it may be a good idea to have
small learnable hypothesis spaces.
We're taking this to the radical conclusion, basically saying, oh, if we have a time series,
that does something, we're going to just assume that we can find a trend and have a linear
function that approximates this best.
How are we going to do that?
Now, kind of what we've done yesterday was the first training exercise.
Let's do it for a simple case.
Univariate linear regression.
It becomes a little bit more interesting if we actually have higher dimensions.
Multivariate regression.
Namely we have multiple linear functions from vectors to maybe vectors, maybe a single real
value.
And then this becomes a bit more useful.
But in the easy case, things are very simple.
And the poster child is you have some kind of XY thing, like house prices versus size
of the house in somewhere.
We have some data points and we find a trend in this.
How do we compute this?
The idea is very, very simple.
What we do is loss minimisation, just as we've prepared the whole thing.
And then in some cases, if the loss function is differentiable, we can just basically do
whatever we did in high school, make all the partial derivatives zero.
And in some cases, i.e. in the linear case, for quadratic distance loss, we can even have
closed form solutions.
How nice.
If we have more interesting loss functions, we can still do things than we do gradient
descent.
And the nice thing about this particular case is that we very often have convex weight spaces.
And in convex spaces, like a distorted sphere, we don't have local maxima or minima.
And so we're in a good case, which actually means hill climbing, that's what the search
algorithm behind gradient descent really is, is guaranteed to find the global optimum.
We don't have to do things like simulated annealing, where we do random restarts with
funny temperature optimisation and settlement things.
But we can just basically start anywhere, follow the highest, or follow the steepest
descent, just like a ball would, and find the global minima.
So that's, in a way, is the good case.
And we can actually use this gradient descent update.
That we can compute via the partial derivatives of the loss function, and just search in a
local search algorithm for it.
And there are various, and you can imagine that this is linear regression, it's something
you use all the time, and there's been a lot of engineering.
You can kind of do it one step at a time, or with multiple searchers, which would be
something like beam search, or with multiple steps at a time, at different step sizes,
and do the step sizes decay, and so on.
All of those things are practically important.
Presenters
Zugänglich über
Offener Zugang
Dauer
01:26:57 Min
Aufnahmedatum
2025-06-25
Hochgeladen am
2025-06-27 01:59:06
Sprache
en-US